Add RoPE embeddings by samanklesaria · Pull Request #5481 · google/flax

samanklesaria · 2026-06-02T17:25:27Z

This PR adds support for RoPE. Specifically, a new function dot_product_attention_with_rope can be used as the attention_fn argument for nnx.MultiHeadAttention.

vfdev-5 · 2026-06-02T20:00:55Z

+    self.cos_cached = nnx.Variable(jnp.cos(freqs_outer).astype(dtype))
+    self.sin_cached = nnx.Variable(jnp.sin(freqs_outer).astype(dtype))


In multiple JAX repositories I saw that sin and cos are constructed from segment_pos (absolute token positions):

our gemma example

bonsai and jax-llm-examples

gemma4

The idea is that if the input sequence is not packed, but just padded, our implementation would mostly work as expected. In case of packed sequence where multiple sentences are inserted in the same sequence:

[<bos>, 1, 2, 3, <eos>, <bos>, 4, 5, <eos>, <pad>, <pad>]

then the input positions (segment_pos) would be:

[0, 1, 2, 3, 4, 0, 1, 2, 3, 0, 0]

so, RoPE may be computed differently.
On the other hand, current MHA.__call__ does not accept any positions arg, so we can't pass it to RoPE...

In PyTorch, basic implementation does something similar to your implementation, but cos and sin cached from the input x.

My implementation was based on the one in equinox, which doesn't pass explicit positions.

We could give MHA a new optional positions argument which we thread through to the attention_fn. But this could break users' custom attention_fn implementations if they weren't expecting the argument.

We could make a MHA subclass with a call method that accepts a positions argument. Say, PackedMHA? This is a little more ugly, but wouldn't be a breaking change.

Can we do the following?

class MHA: def __call__(self, ..., input_positions: Array | None = None): ... attn_kwargs = {} if input_positions is not None: attn_kwargs["input_positions"] = input_positions x = self.attention_fn( query, key, value, mask=mask, dropout_rng=dropout_rng, dropout_rate=self.dropout_rate, broadcast_dropout=self.broadcast_dropout, deterministic=deterministic, dtype=self.dtype, precision=self.precision, module=self if sow_weights else None, is_causal=is_causal, **attn_kwargs, ) def dot_product_attention_with_rope(..., rope, input_positions: Array | None = None, **kwargs) # handle properly input_positions is None # input_positions: (B, S) apply = jax.vmap(rope, in_axes=(-2, -1), out_axes=(-2, -1)) query = apply(query, input_positions) key = apply(key, input_positions) ...

I like that! So users that aren't already calling MHA with input positions won't have their custom attention_fn break!

Should we still keep the cached sin and cos vectors for use when input_positions=None? Might be slightly faster for the non-packed case, as we wouldn't need to rebuild them. But the interface could be nicer if we just rebuilt them every time, so that the RoPE constructor wouldn't need max_seq_len and embedding_size arguments (dynamically getting them from the input x). What do you think?

We can cache them after the first call ?

So we'd check if the input_positions is the same as the cached one at each call. If so, we'd use the cache (populated on the first call) and otherwise, we'd generate it dynamically?

To work in tree mode, we'd need to create the Variables for the cache at initialization. But then we could write to these variables on the first __call__.

If we're doing cross attention, things get trickier. The sequence lengths for the keys might not be the same as the lengths for the values. Caching a single value wouldn't account for this. Caching everything would break if the packing of each batch might be different.

Good point, actually, not sure if seen RoPE for cross-attention, but it makes sense that we may need to have q_positions and kv_positions or even k_positions, v_positions. But it can become rather cumbersome the API finally

I ended up at a compromise API. If the user wants caching, they can specify the max size during initialization, like I had before. Otherwise, we don't cache, and compute the rotation matrices on the fly.

This also means we don't have to use QDD for the cache, which might become deprecated if we end up switching to hijax variables and QDD is removed. One less thing to worry about down the road.

samanklesaria force-pushed the rope branch from 922c76c to 94997d5 Compare June 2, 2026 17:48

samanklesaria requested a review from vfdev-5 June 2, 2026 17:51

samanklesaria force-pushed the rope branch from 94997d5 to 859b726 Compare June 2, 2026 18:31

vfdev-5 reviewed Jun 2, 2026

View reviewed changes

samanklesaria force-pushed the rope branch 2 times, most recently from 66c99da to 319550d Compare June 3, 2026 18:00

Add RoPE embeddings

ea04909

samanklesaria force-pushed the rope branch from 319550d to ea04909 Compare June 3, 2026 18:04

samanklesaria requested a review from vfdev-5 June 3, 2026 18:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RoPE embeddings#5481

Add RoPE embeddings#5481
samanklesaria wants to merge 1 commit into
google:mainfrom
samanklesaria:rope

samanklesaria commented Jun 2, 2026

Uh oh!

Uh oh!

vfdev-5 Jun 2, 2026

Uh oh!

samanklesaria Jun 2, 2026 •

edited

Loading

Uh oh!

vfdev-5 Jun 3, 2026

Uh oh!

samanklesaria Jun 3, 2026

Uh oh!

samanklesaria Jun 3, 2026

Uh oh!

vfdev-5 Jun 3, 2026

Uh oh!

samanklesaria Jun 3, 2026

Uh oh!

samanklesaria Jun 3, 2026

Uh oh!

vfdev-5 Jun 3, 2026

Uh oh!

samanklesaria Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		self.cos_cached = nnx.Variable(jnp.cos(freqs_outer).astype(dtype))
		self.sin_cached = nnx.Variable(jnp.sin(freqs_outer).astype(dtype))

Conversation

samanklesaria commented Jun 2, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

samanklesaria Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

samanklesaria Jun 2, 2026 •

edited

Loading